A new fast food chain is seeing rapid expansion over the past couple of years. They are now trying to optimize their supply chain to ensure that there are no shortages of ingredients. For this, they’ve tasked their data science team to come up with a model that could predict the output of each food processing farm over the next few years.
These predictions could further increase the efficiency of their current supply chain management systems. In this competition you are expected to build a machine learning model(s) that could predict the output of the food processing farms for the next year.
About Data: There are 5 datasets along with a sample submission file provided to you in this competition. The datasets are named as follows:
train_data.csv’:
date: The timestamp at which the yield of the food processing farm was measured
farm_id: The farm identifier that recognizes the farm food processing plant
ingredient_type: The type of ingredient being produced
yield: The yield of the plant in tonnes
-farm_data.csv:
farm_id: The farm identifier that recognizes the farm food processing plant
founding_year: They year when the operations commenced on the farm and food processing plant.
num_processing_plants: The number of processing plants present on the farm
farm_area: The area of the farm in square meters
farming_company: The company that owns the farms
deidentified_location: The location at which the farm is present
train_weather.csv:
For each location where the farms are present, the weather data is also provided by timestam time_stamp: The timestamp at which the yield of the food processing farm was measured.
deidentified_location: The location at which the farm is present.
temp_obs : Observed daily weather temperatures in C` and it's effect
loudiness : the state of the sky when it is covered by clouds and overcast weather, measured in Oktas in the range 0 to 9
wind_direction : Wind direction is usually the direction from which it originates,Reported in degrees
dew_temp : Temperature to which the air have to be cooled in order to reach saturation.
pressure_sea_level : It is the pressure within the atmosphere of Earth, Data points available in Millibars.
precipitation : Condensation from atmospheric water vapor - includes Rain , Data present in mm/day.
wind_speed : Quantity caused by air moving from high to low pressure, usually due to changes in temperature , Measured in Km/hr
!pip install hierarchicalforecast statsforecast datasetsforecast
/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting hierarchicalforecast
Downloading hierarchicalforecast-0.2.1-py3-none-any.whl (33 kB)
Collecting statsforecast
Downloading statsforecast-1.4.0-py3-none-any.whl (91 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 693.0 kB/s eta 0:00:00a 0:00:01
Collecting datasetsforecast
Downloading datasetsforecast-0.0.7-py3-none-any.whl (22 kB)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.7/site-packages (from hierarchicalforecast) (1.0.2)
Collecting quadprog
Downloading quadprog-0.1.11-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (427 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.5/427.5 kB 2.6 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from hierarchicalforecast) (3.5.3)
Requirement already satisfied: numba in /opt/conda/lib/python3.7/site-packages (from hierarchicalforecast) (0.55.2)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from hierarchicalforecast) (1.21.6)
Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from hierarchicalforecast) (1.3.5)
Requirement already satisfied: statsmodels>=0.13.2 in /opt/conda/lib/python3.7/site-packages (from statsforecast) (0.13.2)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from statsforecast) (4.64.1)
Requirement already satisfied: plotly in /opt/conda/lib/python3.7/site-packages (from statsforecast) (5.13.0)
Requirement already satisfied: scipy>=1.7.3 in /opt/conda/lib/python3.7/site-packages (from statsforecast) (1.7.3)
Collecting xlrd>=1.0.0
Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96.5/96.5 kB 7.0 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.7/site-packages (from datasetsforecast) (2.28.1)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.7/site-packages (from datasetsforecast) (3.8.1)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from numba->hierarchicalforecast) (59.8.0)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in /opt/conda/lib/python3.7/site-packages (from numba->hierarchicalforecast) (0.38.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->hierarchicalforecast) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->hierarchicalforecast) (2022.2.1)
Requirement already satisfied: packaging>=21.3 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->statsforecast) (23.0)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.7/site-packages (from statsmodels>=0.13.2->statsforecast) (0.5.2)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (6.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (1.7.2)
Requirement already satisfied: asynctest==0.13.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (0.13.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (2.1.1)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (22.1.0)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (1.2.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (4.1.1)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.7/site-packages (from aiohttp->datasetsforecast) (4.0.2)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->hierarchicalforecast) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->hierarchicalforecast) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib->hierarchicalforecast) (9.2.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib->hierarchicalforecast) (4.37.1)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->hierarchicalforecast) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.7/site-packages (from plotly->statsforecast) (8.0.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests->datasetsforecast) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests->datasetsforecast) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests->datasetsforecast) (1.26.11)
Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->hierarchicalforecast) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->hierarchicalforecast) (3.1.0)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels>=0.13.2->statsforecast) (1.16.0)
Installing collected packages: xlrd, quadprog, hierarchicalforecast, datasetsforecast, statsforecast
Successfully installed datasetsforecast-0.0.7 hierarchicalforecast-0.2.1 quadprog-0.1.11 statsforecast-1.4.0 xlrd-2.0.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pip install fancyimpute
/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash) Requirement already satisfied: fancyimpute in /opt/conda/lib/python3.7/site-packages (0.7.0) Requirement already satisfied: scikit-learn>=0.24.2 in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (1.0.2) Requirement already satisfied: nose in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (1.3.7) Requirement already satisfied: cvxpy in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (1.3.0) Requirement already satisfied: pytest in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (7.2.1) Requirement already satisfied: cvxopt in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (1.3.0) Requirement already satisfied: knnimpute>=0.1.0 in /opt/conda/lib/python3.7/site-packages (from fancyimpute) (0.1.0) Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from knnimpute>=0.1.0->fancyimpute) (1.16.0) Requirement already satisfied: numpy>=1.10 in /opt/conda/lib/python3.7/site-packages (from knnimpute>=0.1.0->fancyimpute) (1.21.6) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24.2->fancyimpute) (1.0.1) Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24.2->fancyimpute) (1.7.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=0.24.2->fancyimpute) (3.1.0) Requirement already satisfied: setuptools<=64.0.2 in /opt/conda/lib/python3.7/site-packages (from cvxpy->fancyimpute) (59.8.0) Requirement already satisfied: scs>=1.1.6 in /opt/conda/lib/python3.7/site-packages (from cvxpy->fancyimpute) (3.2.2) Requirement already satisfied: ecos>=2 in /opt/conda/lib/python3.7/site-packages (from cvxpy->fancyimpute) (2.0.12) Requirement already satisfied: osqp>=0.4.1 in /opt/conda/lib/python3.7/site-packages (from cvxpy->fancyimpute) (0.6.2.post8) Requirement already satisfied: importlib-metadata>=0.12 in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (6.0.0) Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (1.1.0) Requirement already satisfied: iniconfig in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (2.0.0) Requirement already satisfied: pluggy<2.0,>=0.12 in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (1.0.0) Requirement already satisfied: tomli>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (2.0.1) Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (22.1.0) Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from pytest->fancyimpute) (23.0) Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest->fancyimpute) (4.1.1) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=0.12->pytest->fancyimpute) (3.8.1) Requirement already satisfied: qdldl in /opt/conda/lib/python3.7/site-packages (from osqp>=0.4.1->cvxpy->fancyimpute) (0.1.5.post3) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Note: you may need to restart the kernel to use updated packages.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from statsforecast.core import StatsForecast #statsforecast built mutliple models
from statsforecast.models import AutoARIMA, Naive,IMAPA, MSTL, ETS
from hierarchicalforecast.evaluation import HierarchicalEvaluation
from fancyimpute import IterativeImputer
import collections
from scipy.optimize import lsq_linear
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error as mse
from tqdm.notebook import tqdm
tqdm.pandas()
from statsforecast.models import (
AutoARIMA,
HoltWinters,
CrostonClassic as Croston,
HistoricAverage,
DynamicOptimizedTheta,
SeasonalNaive
)
from statsforecast.models import (
ADIDA,
CrostonClassic,
IMAPA,
TSB
)
from statsforecast.models import (
AutoETS,
HistoricAverage,
Naive,
RandomWalkWithDrift,
SeasonalNaive
)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
/opt/conda/lib/python3.7/site-packages/statsforecast/core.py:21: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console) from tqdm.autonotebook import tqdm
(CVXPY) Feb 24 04:39:19 PM: Encountered unexpected exception importing solver GLOP:
RuntimeError('Unrecognized new version of ortools (9.5.2237). Expected < 9.5.0.Please open a feature request on cvxpy to enable support for this version.')
(CVXPY) Feb 24 04:39:19 PM: Encountered unexpected exception importing solver PDLP:
RuntimeError('Unrecognized new version of ortools (9.5.2237). Expected < 9.5.0.Please open a feature request on cvxpy to enable support for this version.')
%time
df=pd.read_csv("/kaggle/input/ingred-z/train_ing_z.csv",na_values="?")
test=pd.read_csv("/kaggle/input/ingred-z/test_ing_z.csv",na_values="?")
CPU times: user 4 µs, sys: 1 µs, total: 5 µs Wall time: 8.34 µs
#conversion of date into datetime format
df["date"]=pd.to_datetime(df["date"])
#test["date"]=pd.to_datetime(test["date"])
#defined function to check data,shape,null values,descriptive statastics
def check(df):
display(df.head())
print("~"*120)
print(df.shape)
print("~"*120)
display(df.info())
print("~"*120)
display(df.describe().T)
print("~"*120)
display(df.isnull().sum()/(df.shape[0])*100)
check(df)
| date | farm_id | ingredient_type | yield | operations_commencing_year | num_processing_plants | farm_area | farming_company | deidentified_location | temp_obs | cloudiness | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | fid_87942 | ing_z | 0.000 | NaN | 8.0 | 499.2607 | Obery Farms | location 959 | 3.8 | NaN | 240.0 | 2.4 | 1021.0 | NaN | 3.1 |
| 1 | 2016-01-01 | fid_66870 | ing_z | 0.000 | 1953.0 | 10.0 | 5295.0063 | Obery Farms | location 959 | 3.8 | NaN | 240.0 | 2.4 | 1021.0 | NaN | 3.1 |
| 2 | 2016-01-01 | fid_66062 | ing_z | 96.978 | NaN | 10.0 | 2992.0340 | Obery Farms | location 959 | 3.8 | NaN | 240.0 | 2.4 | 1021.0 | NaN | 3.1 |
| 3 | 2016-01-01 | fid_75323 | ing_z | 19.597 | 1958.0 | 13.0 | 9334.9860 | Obery Farms | location 959 | 3.8 | NaN | 240.0 | 2.4 | 1021.0 | NaN | 3.1 |
| 4 | 2016-01-01 | fid_75397 | ing_z | 100.000 | 1958.0 | 17.0 | 12976.9700 | Obery Farms | location 959 | 3.8 | NaN | 240.0 | 2.4 | 1021.0 | NaN | 3.1 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (1290364, 16) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <class 'pandas.core.frame.DataFrame'> RangeIndex: 1290364 entries, 0 to 1290363 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1290364 non-null datetime64[ns] 1 farm_id 1290364 non-null object 2 ingredient_type 1290364 non-null object 3 yield 1290364 non-null float64 4 operations_commencing_year 567002 non-null float64 5 num_processing_plants 227618 non-null float64 6 farm_area 1290364 non-null float64 7 farming_company 1290364 non-null object 8 deidentified_location 1290364 non-null object 9 temp_obs 1287352 non-null float64 10 cloudiness 774518 non-null float64 11 wind_direction 1229269 non-null float64 12 dew_temp 1287128 non-null float64 13 pressure_sea_level 1275501 non-null float64 14 precipitation 1102710 non-null float64 15 wind_speed 1284879 non-null float64 dtypes: datetime64[ns](1), float64(11), object(4) memory usage: 157.5+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| yield | 1290364.0 | 378.504239 | 2482.995084 | 0.0000 | 0.000 | 37.96495 | 232.878 | 160187.00 |
| operations_commencing_year | 567002.0 | 1971.043638 | 25.572731 | 1900.0000 | 1958.000 | 1968.00000 | 1989.000 | 2012.00 |
| num_processing_plants | 227618.0 | 9.006129 | 2.976445 | 5.0000 | 6.000 | 9.00000 | 10.000 | 17.00 |
| farm_area | 1290364.0 | 10553.765391 | 9361.274673 | 499.2607 | 4587.922 | 8019.66550 | 14087.532 | 67999.88 |
| temp_obs | 1287352.0 | 17.186578 | 11.589553 | -28.9000 | 8.900 | 17.40000 | 25.600 | 47.20 |
| cloudiness | 774518.0 | 1.199543 | 1.836257 | 0.0000 | 0.000 | 0.00000 | 2.000 | 9.00 |
| wind_direction | 1229269.0 | 175.621738 | 112.878704 | 0.0000 | 80.000 | 180.00000 | 270.000 | 360.00 |
| dew_temp | 1287128.0 | 4.689789 | 9.087098 | -35.0000 | -1.700 | 4.40000 | 11.700 | 26.10 |
| pressure_sea_level | 1275501.0 | 1014.443356 | 7.345206 | 973.5000 | 1009.500 | 1014.00000 | 1019.000 | 1046.00 |
| precipitation | 1102710.0 | 0.523462 | 4.855797 | -1.0000 | 0.000 | 0.00000 | 0.000 | 333.00 |
| wind_speed | 1284879.0 | 3.126595 | 2.062691 | 0.0000 | 2.100 | 3.10000 | 4.100 | 18.50 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
date 0.000000 farm_id 0.000000 ingredient_type 0.000000 yield 0.000000 operations_commencing_year 56.058756 num_processing_plants 82.360171 farm_area 0.000000 farming_company 0.000000 deidentified_location 0.000000 temp_obs 0.233423 cloudiness 39.976782 wind_direction 4.734711 dew_temp 0.250782 pressure_sea_level 1.151846 precipitation 14.542718 wind_speed 0.425074 dtype: float64
[operating_commencing _year,num plants,cloudiness,]#drop columns which are greater than 20%
def del_col(df):
todel=["operations_commencing_year","num_processing_plants","ingredient_type","cloudiness","farming_company"]
df=df.drop(todel,axis=1,inplace=True)
del_col(df)
#del_col(test)
Dropping ingredient_type,farming_company .Because ingredient type variance is constant and farming_company doesnt add value to data its for insights and in hirerchal order farm_company have overlapping farms so its messy to understand the data based on farm company.Hence remove the column
#Encoding the deidentified _location using label encoder
lr=LabelEncoder()
df["deidentified_location"]=lr.fit_transform(df["deidentified_location"])
#Aggregating the data based on columns
agg_dict_train = {
"yield": pd.NamedAgg(column='yield', aggfunc=np.sum),
"farm_area": pd.NamedAgg(column='farm_area', aggfunc=np.sum),
"temp_obs": pd.NamedAgg(column='temp_obs', aggfunc=np.mean),
"wind_direction": pd.NamedAgg(column='wind_direction', aggfunc=np.mean),
"dew_temp": pd.NamedAgg(column='dew_temp', aggfunc=np.mean),
"wind_speed": pd.NamedAgg(column='wind_speed', aggfunc=np.mean),
"pressure_sea_level": pd.NamedAgg(column='pressure_sea_level', aggfunc=np.mean),
"precipitation": pd.NamedAgg(column='precipitation', aggfunc=np.mean),
"deidentified_location": pd.NamedAgg(column='deidentified_location', aggfunc=pd.Series.mode)
}
agg_dict_test = {
"farm_area": pd.NamedAgg(column='farm_area', aggfunc=np.sum),
"temp_obs": pd.NamedAgg(column='temp_obs', aggfunc=np.mean),
"wind_direction": pd.NamedAgg(column='wind_direction', aggfunc=np.mean),
"dew_temp": pd.NamedAgg(column='dew_temp', aggfunc=np.mean),
"wind_speed": pd.NamedAgg(column='wind_speed', aggfunc=np.mean),
"pressure_sea_level": pd.NamedAgg(column='pressure_sea_level', aggfunc=np.mean),
"precipitation": pd.NamedAgg(column='precipitation', aggfunc=np.mean),
"deidentified_location": pd.NamedAgg(column='deidentified_location', aggfunc=pd.Series.mode)}
#Grouping train_data
data=df.groupby(["date","farm_id"]).agg(**agg_dict_train)
data=data.reset_index(["farm_id"]).sort_values(by=["farm_id"])
#Grouping test_data
#test=test.groupby(["date","farm_id"]).agg(**agg_dict_test)
#test=test.reset_index(["farm_id"]).sort_values(by=["farm_id"])
As per the data record i.e (24 366 145)= 1273680 ,But the size differs
#create a time range series to reindex the original data
train_seq=pd.date_range(start="2016-01-01 00:00:00",end="2016-12-31 23:00:00",freq="H")
test_seq=pd.date_range(start="2017-01-01 00:00:00",end="2017-12-31 23:00:00",freq="H")
#padding train series
hrs=[]
for i in tqdm(df["farm_id"].unique()):#take unique farm ids
x=data[data["farm_id"]==i].reindex(train_seq) #reindex the every farm_id to overcome the irregularity
hrs.append(x)
main_train=pd.concat(hrs)
main_train
| farm_id | yield | farm_area | temp_obs | wind_direction | dew_temp | wind_speed | pressure_sea_level | precipitation | deidentified_location | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 00:00:00 | fid_87942 | 0.000 | 499.2607 | 3.8 | 240.0 | 2.4 | 3.1 | 1021.0 | NaN | 8 |
| 2016-01-01 01:00:00 | fid_87942 | 10.000 | 499.2607 | 3.7 | 230.0 | 2.4 | 2.6 | 1021.5 | NaN | 8 |
| 2016-01-01 02:00:00 | fid_87942 | 10.000 | 499.2607 | 2.6 | 0.0 | 1.9 | 0.0 | 1022.0 | NaN | 8 |
| 2016-01-01 03:00:00 | fid_87942 | 10.000 | 499.2607 | 2.0 | 170.0 | 1.2 | 1.5 | 1022.5 | NaN | 8 |
| 2016-01-01 04:00:00 | fid_87942 | 0.000 | 499.2607 | 2.3 | 110.0 | 1.8 | 1.5 | 1022.5 | NaN | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2016-12-31 19:00:00 | fid_74945 | 133.282 | 7561.3750 | -10.3 | 70.0 | -11.9 | 5.1 | 1007.0 | 2.0 | 7 |
| 2016-12-31 20:00:00 | fid_74945 | 131.019 | 7561.3750 | -9.9 | 70.0 | -11.5 | 5.7 | 1006.5 | 2.0 | 7 |
| 2016-12-31 21:00:00 | fid_74945 | 122.936 | 7561.3750 | -9.9 | 70.0 | -11.5 | 5.1 | 1006.0 | NaN | 7 |
| 2016-12-31 22:00:00 | fid_74945 | 125.765 | 7561.3750 | -9.8 | 60.0 | -11.1 | 5.1 | 1005.5 | 3.0 | 7 |
| 2016-12-31 23:00:00 | fid_74945 | 121.401 | 7561.3750 | -9.6 | 70.0 | -10.8 | 4.1 | 1004.5 | 5.0 | 7 |
1273680 rows × 10 columns
Now Data is ready to rock! But need to impute the missing values
#Missing_values
plt.figure(figsize=(15,8))
plt.plot(Y_train["ds"],Y_train["y"].rolling(168).mean(),"-")
plt.plot(Y_test["ds"],Y_test["y"].rolling(168).mean(),"-")
plt.legend(["train","test"])
plt.show()
#padding test series
hrs_te=[]
for i in tqdm(test["farm_id"].unique()):
x=test[test["farm_id"]==i].reindex(test_seq)
hrs_te.append(x)
main_test=pd.concat(hrs_te)
def train_imputation(main):
imputer = IterativeImputer() #Using iterativeimputer
to_fill=["temp_obs","wind_direction",
"dew_temp","pressure_sea_level","precipitation","wind_speed"]
main[to_fill] = imputer.fit_transform(main[to_fill])
main["farm_id"]=main["farm_id"].fillna(method="bfill") #bfill method is used for farm_id
main["farm_area"]=main["farm_area"].fillna(method="bfill")
main["yield"]=main["yield"].fillna(0) #yield should be zero there is missing date in the sense no yield produced
return main
def test_imputation(main):
imputer = IterativeImputer()
to_fill=["temp_obs","wind_direction",
"dew_temp","pressure_sea_level","precipitation","wind_speed"]
main[to_fill] = imputer.transform(main[to_fill])
main["farm_id"]=main["farm_id"].fillna(method="bfill")
main["farm_area"]=main["farm_area"].fillna(method="bfill")
return main
train_imputation(main_train)
#test_imputation(main_test)
| farm_id | yield | farm_area | temp_obs | wind_direction | dew_temp | wind_speed | pressure_sea_level | precipitation | deidentified_location | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 00:00:00 | fid_87942 | 0.000 | 499.2607 | 3.8 | 240.0 | 2.4 | 3.1 | 1021.0 | 0.839043 | 8 |
| 2016-01-01 01:00:00 | fid_87942 | 10.000 | 499.2607 | 3.7 | 230.0 | 2.4 | 2.6 | 1021.5 | 0.754091 | 8 |
| 2016-01-01 02:00:00 | fid_87942 | 10.000 | 499.2607 | 2.6 | 0.0 | 1.9 | 0.0 | 1022.0 | 0.835083 | 8 |
| 2016-01-01 03:00:00 | fid_87942 | 10.000 | 499.2607 | 2.0 | 170.0 | 1.2 | 1.5 | 1022.5 | 0.665092 | 8 |
| 2016-01-01 04:00:00 | fid_87942 | 0.000 | 499.2607 | 2.3 | 110.0 | 1.8 | 1.5 | 1022.5 | 0.810089 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2016-12-31 19:00:00 | fid_74945 | 133.282 | 7561.3750 | -10.3 | 70.0 | -11.9 | 5.1 | 1007.0 | 2.000000 | 7 |
| 2016-12-31 20:00:00 | fid_74945 | 131.019 | 7561.3750 | -9.9 | 70.0 | -11.5 | 5.7 | 1006.5 | 2.000000 | 7 |
| 2016-12-31 21:00:00 | fid_74945 | 122.936 | 7561.3750 | -9.9 | 70.0 | -11.5 | 5.1 | 1006.0 | 2.590092 | 7 |
| 2016-12-31 22:00:00 | fid_74945 | 125.765 | 7561.3750 | -9.8 | 60.0 | -11.1 | 5.1 | 1005.5 | 3.000000 | 7 |
| 2016-12-31 23:00:00 | fid_74945 | 121.401 | 7561.3750 | -9.6 | 70.0 | -10.8 | 4.1 | 1004.5 | 5.000000 | 7 |
1273680 rows × 10 columns
Iterative Imputer is a machine learning technique used for imputing missing values in a dataset. It works by filling in the missing values with predicted values that are generated by iterating over the features of the dataset.
The iterative imputer algorithm uses a regression model to predict the missing values for each feature, and then repeats this process until the missing values converge to a stable solution. The algorithm is based on the idea that the missing values in a dataset are not completely random, but rather are correlated with other features in the dataset.

main_train.reset_index(inplace=True) #reset the index of the data
#main_test.reset_index(inplace=True)
main_train.rename(columns={"farm_id":"unique_id","index":"ds","yield":"y"},inplace=True)
#main_test.rename(columns={"farm_id":"unique_id","index":"ds"},inplace=True)
y=main_train[["ds","unique_id","y"]]
Renaming variable names such that model can understand the variables easily
StatsForecast.plot(y)
x=main_train[["unique_id","ds","farm_area","temp_obs",
"wind_direction","dew_temp","pressure_sea_level",
"precipitation","wind_speed"]]
y=y.sample(frac=1).sort_values(by="ds")
x=x.sample(frac=1).sort_values(by="ds")
x['unique_id'] = x.unique_id.astype(str)
StatsForecast.plot(x)